Known Issues
date | category | content | status |
---|---|---|---|
2023/04/07 | Job | Error message 'configuration error-unknown item 'HOME_MODE' (notify administrator)' is output when job is submitted from interactive node (V). Ignore the message because it does not appear to be a problem with job execution. | 2023/05/18 close The error message output has been resolved. |
2023/01/31 | Application | The Intel oneAPI has been found to be vulnerable, so the commands(icpx,icpc) have been disabled. 2023/02/06 Removed the execute permission of the directory where the vulnerable Intel oneAPI was installed. |
2023/02/03 Close. Updated the Intel oneAPI to a fixed version. Please note that programs compiled with previous version may contain vulnerabilities, so please recompile with the new version. Refer to System Updates for the version number. 2023/02/06 intel/2022.0.2 and earlier Intel oneAPI modules containing vulnerabilities have been deprecated. Please use intel/2022.2.1 module that fixes the vulnerability.Programs compiled with previous version of the Intel oneAPI modules, which was deprecated on Feb 6, may no longer run, so please recompile with the newer version. |
2022/12/23 | Application | We have confirmed that the cudnnConvolutionForward function fails when using cuDNN 8.7.0 with CUDA 10.2 in the Compute Node (A). |
We have confirmed that cuDNN 8.7.0 is available in CUDA 11.x. Please use CUDA 11.x when using cuDNN 8.7.0 in the Compute Node (A). |
2022/12/13 | Singularity Endpoint | After maintenance on December 13, due to a failure in some Singularity Endpoint features (pull and Remote Build), the Singularity Endpoint is no longer operational. | 2023/01/05 close This issue has been resolved with a SingularityPRO update. |
2022/06/13 | Job | The following problems happened during 2022/06/11 21:00 and 06/13 09:48. • A new batch job on Compute Node (A)(V) failed to execute. • All of the reservations for Compute Node (A)(V) disappeared. Please resubmit your batch job(s). Please recreate your reservation(s). |
2022/06/22 close. |
2022/05/09 | FileSystem | When multiple threads issue fallocate system calls to the same file on the Lustre file system, almost simultaneously, a deadlock may occur depending on the unlucky timing.This problem has been confirmed to cause the Home area to become inaccessible. |
2022/06/21 This issue has been resolved with a Lustre update. |
2022/04/06 | Singularity | A known issue has been identified that the remote build feature of Singularity endpoints is not available. As an alternative, please use the --fakeroot option to create container images. Note that the Library and Keystore functions of Singularity endpoints are currently available. |
2022/04/14 close. The remote build feature failure has been cleared. |
2022/04/06 | Job | Because of a job scheduler problem, we have confirmed that the reserved service reservation disappears when the system stops. Please refrain from making reservations for the post-maintenance period until the incident is resolved. |
2022/06/21 This issue has been resolved with an Altair Grid Engine update. |
2022/01/21 | Application | A known issue has been identified that the execution of vtune using intel-vtune/2020.3 module fails on the compute node (A). | 2022/04/06 This issue has been resolved with an Intel VTune update. |
2021/12/17 | Application | A known issue has been identified that the execution of distributed deep learning using pytorch and NCCL fails on the compute node (A). To avoid this issue, set the following environment variable in your job script. NCCL_IB_DISABLE=1 |
2022/03/03 Close. An update to OFED has resolved the issue. |
2021/10/19 | MPI | In OpenMPI 3.1.6 on the compute node (V), we have confirmed that when the -mca pml cm flag is specified in the mpirun command, processing stops and does not proceed in MPI_Send/MPI_Recv. | OpenMPI 3 is no longer supported, so please use OpenMPI 4. |
2021/07/06 | Singularity | The remote build function is not available due to a failure of the Remote Builder service. | 2021/07/21 close. Resolved a communication problem in Remote Builder service. |
2021/05/25 | GPU | A known issue has been identified that when using the GPU repeatedly, the processes remain with status D or Z and GPU memory is not released. When you try to use that GPU after this symptom, subsequent processes will not run normally because the GPU memory has not been released normally. If you find this symptom, please contact us at qa@abci.ai. | 2021/08/12 close. This issue has been resolved. |
2020/05/17 | MPI | With Open MPI 4.0.5, a MPI program execution using 66 nodes or more will be failed. If you use 66 nodes or more, please set mca parameters plm_rsh_no_tree_spawn to true and plm_rsh_num_concurrent to $NHOSTS when invoking the executable. $ mpirun -mca plm_rsh_no_tree_spawn true -mca plm_rsh_num_concurrent $NHOSTS ./a.out |
2021/05/31 close Modified the default value of these mca parameters |
2020/09/30 | Singularity | SingularityPRO on ABCI has the following security issues. The issues affect on using SingularityPRO on the interactive nodes and in jobs that use resource types other than Full. Users are recommended to use SingularityPRO on Full resource type until it is updated. CVE-2020-25039 CVE-2020-25040 |
2020/10/09 close Updated to the fixed version, 3.5-4 |
2020/01/14 | Cloud Storage | The amount of object data is inconsistent, when the user of other groups put or delete objects in the bucket granted write permission by ACL. As a result, ABCI points to be consumed are not calculated correctly. | 2020/04/03 close Updated to the fixed version |
2019/11/14 | Cloud Storage | Due to a bug in object storage, following error messages are output when overwriting or deleting objects that stored in multiparts. [Overwrite] upload failed: object to s3://mybucket/object An error occurred (None) when calling the CompleteMultipartUpload operation: undefined [Delete] delete failed: s3://mybucket/object An error occurred (None) when calling the DeleteObject operation: undefined When you use the s3 command of AWS CLI, a large file is stored in multiparts. If you upload a large file, please refer to this page and set multipart_threshold to a large value. |
2019/12/17 close |
2019/10/04 | MPI | MPI_Allreduce provided by MVAPICH2-GDR 2.3.2 raises floating point exceptions in the following combinations of nodes, GPUs and message sizes when reduction between GPU memories is conducted. Nodes: 28, GPU/Node: 4, Message size: 256KB Nodes: 30, GPU/Node: 4, Message size: 256KB Nodes: 33, GPU/Node: 4, Message size: 256KB Nodes: 34, GPU/Node: 4, Message size: 256KB |
2020/04/21 close Updated to the fixed version |
2019/04/10 | Job | The following qsub option requires to specify argument due to job scheduler update (8.5.4 -> 8.6.3). resource type ( -l rt_F etc) $ qsub -g GROUP -l rt_F=1 $ qsub -g GROUP -l rt_G.small=1 |
close |
2019/04/10 | Job | The following qsub option requires to specify argument due to job scheduler update (8.5.4 -> 8.6.3). use BEEOND ( -l USE_BEEOND) $ qsub -g GROUP -l rt_F=2 -l USE_BEEOND=1 |
close |
2019/04/05 | Job | Due to job scheduler update (8.5.4 -> 8.6.3), a comupte node can execute only up to 2 jobs each resource type "rt_G.small" and "rt_C.small" (normally up to 4 jobs ).This situation also occures with Reservation service, so to be careful when you submit job with "rt_G.small" or "rt_C.small". $ qsub -ar ARID -l rt_G.small=1 -g GROUP run.sh (x 3 times) $ qstat job-ID prior name user state -------- 478583 0.25586 sample.sh username r 478584 0.25586 sample.sh username r 478586 0.25586 sample.sh username qw |
2019/10/04 close |